Julia vs. Other Technical Computing Languages

Why is vectorized code fast ? And, why is not as fast as it could be ?

Generators, Comprehensions, Reduction, Broadcasting, Syntactic Loop Fusion, ...

Author: Ján Dolinský, Tangent Works

Generators

Matlab/R Vectorized Way



In [ ]:

    
using BenchmarkTools



In [ ]:

    
a, f = rand(10^5), rand(10^5);



In [ ]:

    
sizeof(a)



In [ ]:

    
poly_R(x) = 3x.^2 + 2x - sqrt.(x) - 1



In [ ]:

    
@btime poly_R(a);

Conclusion:
7 * 800_000 bytes = 5_600_000 bytes (approx. 5.34MiB)
The code in function poly_R() does 7 unnecessary allocations!

tmp1 = x.^2
tmp2 = 3*tmp1
tmp3 = 2x
tmp4 = sqrt.(x)
tmp5 = tmp2 + tmp3
tmp6 = tmp5 - tmp4
tmp7 = tmp6 - 1

Next: Let's rewrite poly_R() into poly() and use it as our kernel function.



In [ ]:

    
poly(x) = 3x^2 + 2x - √x - 1

Julia: Explicit Loop



In [ ]:

    
function poly_J_loop(v)
    u = similar(v)
    
    for i ∈ eachindex(v)
        u[i] = poly(v[i])
    end
    
    u
end



In [ ]:

    
@btime poly_J_loop(a);

Conclusion:
1 * 800_000 bytes = 800_000 bytes (approx. 781.34KiB)
Only a single allocation for target vector is made.

Julia: Generator & Comprehension



In [ ]:

    
poly_J(v) = [ poly(x) for x in v ]



In [ ]:

    
poly_J_alternative(v) = [ 3x^2 + 2x - √x - 1 for x in v ]



In [ ]:

    
@btime poly_J(a);

Conclusion:
1 * 800_000 bytes = 800_000 bytes (approx. 781.34KiB)
Only a single allocation for target vector is made.

Julia: Map()



In [ ]:

    
poly_J_map(v) = map(poly, v)



In [ ]:

    
poly_J_map_alternative(v) = map(x -> 3x^2 +2x - √x - 1, v)



In [ ]:

    
@btime poly_J_map(a);

Julia: Loop fusion



In [ ]:

    
@btime poly.(a);

Matlab/R Vectorized Way



In [ ]:

    
polysum_R(v) = sum(poly_R(v))



In [ ]:

    
@btime polysum_R(a)

Julia way



In [ ]:

    
@btime sum(poly, a)

Julia way: Explicit mapreduce()



In [ ]:

    
@btime mapreduce(poly, +, a)



In [ ]:

    
@btime mean(poly, a)

Matlab/R Vectorized Way



In [ ]:

    
mse_R(a, f) = mean((a - f).^2)



In [ ]:

    
sizeof(a)



In [ ]:

    
@btime mse_R(a, f)

Conclusion:
2 * 800_000 bytes = 1_600_000 bytes (approx. 1.53MiB)

tmp1 = a - f
tmp2 = tmp1.^2

Julia way



In [ ]:

    
mse_J(a, f) = mean( (a[i] - f[i])^2 for i ∈ eachindex(a) )



In [ ]:

    
@btime mse_J(a, f)



In [ ]:

    
mse_J_zip_alternative(a, f) = mean( (x - y)^2 for (x, y) ∈ zip(a, f) )



In [ ]:

    
@btime mse_J_zip_alternative(a, f)

Matlab/R way



In [ ]:

    
mape_R(a, f) = 100 * mean(abs.((a - f) ./ a))



In [ ]:

    
@btime mape_R(a, f)



In [ ]:

    
mape_J(a, f) = 100 * mean( abs((a[i] - f[i]) / a[i]) for i ∈ eachindex(a) if a[i] != 0 )



In [ ]:

    
@btime mape_J(a, f)



In [ ]:

    
mape_J_zip_alternative(a, f) = 100 * mean( abs((x - y) / x) for (x, y) ∈ zip(a, f) if x != 0 )



In [ ]:

    
@btime mape_J_zip_alternative(a, f)

Element-by-element operation on arrays of different sizes.

Example 1: Mean correction



In [ ]:

    
m = [fill(1.0, 10) fill(2.0, 10) fill(3.0, 10)]



In [ ]:

    
μ = mean(m, 1)



In [ ]:

    
m .- μ



In [ ]:

    
broadcast((x, y) -> x - y, m , μ)

Example 2: Z-Score



In [ ]:

    
m = rand(10^3, 3);



In [ ]:

    
_zscore(x, μ, σ) = (x - μ) / σ



In [ ]:

    
function zscore(m)
    μ = mean(m)
    σ = std(m)
    
    _zscore.(m , μ, σ)
end



In [ ]:

    
@time zscore(m);

Challenge
Implement Rank-1 update and explain the problem with a Matlab/R implementation of such a problem



In [ ]:

    
A, u, v = ones(10^3, 10^2), fill(2.0, 10^3), fill(3.0, 10^2)



In [ ]:

    
sizeof(A)



In [ ]:

    
rank1_naive(A, u, v) = A - u*v'



In [ ]:

    
@time rank1_naive(A, u, v)



In [ ]:

    
_rank1(x, y, z) = x - y * z



In [ ]:

    
rank1_J_1(A, u, v) = _rank1.(A, u, v')



In [ ]:

    
@time rank1_J_1(A, u, v);



In [ ]:

    
rank1_J_2(A, u, v) = A .- u.*v'



In [ ]:

    
@time rank1_J_2(A, u, v);



In [ ]:

    
rank1_J_3(A, u, v) = broadcast(_rank1, A, u, v')



In [ ]:

    
@time rank1_J_3(A, u, v)